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Abstract 

We discuss the scalable parallel solution of the Poisson equation within a Particle-In-Cell (PIC) code for the 
simulation of electron beams in particle accelerators of irregular shape. The problem is discretized by Finite 
Differences. Depending on the treatment of the Dirichlet boundary the resulting system of equations is sym- 
metric or 'mildly' nonsymmetric positive definite. In all cases, the system is solved by the preconditioned 
conjugate gradient algorithm with smoothed aggregation (SA) based algebraic multigrid (AMG) precondi- 
tioning. We investigate variants of the implementation of SA-AMG that lead to considerable improvements 
in the execution times. We demonstrate good scalability of the solver on distributed memory parallel pro- 
cessor with up to 2048 processors. We also compare our SAAMG-PCG solver with an FFT-based solver 
that is more commonly used for applications in beam dynamics. 

Keywords: Poisson equation, Irregular domains, Preconditioned conjugate gradient algorithm, Algebraic 
multigrid, Beam dynamics, Space-charge 



1. Introduction 

In recent years, precise beam dynamics simulations in the design of high-current low-energy hadron 
machines as well as of 4th generation light sources have become a very important research topic. Hadron 
machines are characterized by high currents and hence require excellent control of beam losses and/or keeping 
the emittance (a measure of the phase space) of the beam in narrow ranges. This is a challenging problem 
which requires the accurate modeling of the dynamics of a large ensemble of macro or real particles subject to 
complicated external focusing, accelerating and wake-fields, as well as the self-fields caused by the Coulomb 
interaction of the particles. In general the geometries of particle accelerators are large and complicated. The 
discretization of the computational domain is time dependent due to the relativistic nature of the problem, 
time and space dilatation. Both phenomenas have a direct impact on the numerical solution method. 

The solver described in this paper is part of a general accelerator modeling tool OPAL (Object Oriented 
Parallel Accelerator Library) [2J. OPAL allows to tackle the most challenging problems in the field of high 
precision particle accelerator modeling. These include the simulation of high power hadron accelerators and 
of next generation light sources. 

Some of the effects can be studied by using a low dimensional model, i.e., envelope equations [6ll2"T |l2"2"ll2"7] . 
These are a set of ordinary differential equations for the second-order moments of a time-dependent particle 
distribution. They can be calculated fast, however the level of detail is mostly insufficient for quantitative 
studies. Furthermore, a priori knowledge of critical beam parameters such as the emittance is required with 
the consequence that the envelope equations cannot be used as a self-consistent method. 
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One way to overcome these limitations is by considering the Vlasov-Poisson description of the phase 
space, including external and self-fields and, if needed, other effects such as wakes. To that end let /(x, v, t) 
be the density of the particles in the phase space, i.e., the position- velocity (x, v) space. Its evolution is 
determined by the collisionlcss Vlasov equation, 

| = a t / + v-V I / + i-(E + vxB)-V,/ = I (1) 

at mo 

where too, q denote particle mass and charge, respectively. The electric and magnetic fields E and B arc 
superpositions of external fields and self-fields (space charge) , 

E = E cxt + Egeif , B = B cxt + B se if. (2) 

If E and B are known, then each particle can be propagated according to the equation of motion for charged 
particles in an electromagnetic field, 

^= V) ^ = l(E + vxB). 
dt dt too 

After the movement of the particles Egejf and B se if have to be updated. To that end we change the 
coordinate system into the one moving with the particles. By means of the appropriate Lorentz transforma- 
tion 115] we arrive at a (quasi-) static approximation of the system in which the transformed magnetic field 
becomes negligible, B«0. The transformed electric field is obtained from 

E = E se if = -V4>, (3) 

where the electrostatic potential is the solution of the Poisson problem 

-A0(x) = M, (4) 

equipped with appropriate boundary conditions, see section [2] Here, p denotes the spatial charge density 
and e is the dielectric constant. By means of the inverse Lorentz transformation the electric field E can 
then be transformed back to yield both the electric and the magnetic fields in ph. 

The Poisson problem Q discretized by finite differences can efficiently be solved on a rectangular grid 
by a Particle-In-Cell (PIC) approach [20]. The right hand side in Q is discretized by sampling the particles 
at the grid points. In (|3j), is interpolated at the particle positions from its values at the grid points. We 
also note that the FFT-based Poisson solvers and similar approaches [THl [20] are restricted to box-shaped 
or open domains. 

Serafini et al. [24> report on a state-of-the-art conventional FFT-based algorithm for solving the Poisson 
equation with 'infinite-domain', i.e., open boundary conditions for large problems in accelerator modeling. 
The authors show improvements in both accuracy and performance, by combining several techniques: the 
method of local corrections, the James algorithm, and adaptive mesh refinement. 

However with the quest of high intensity, high brightness beams together with ultra low particle losses, 
there is a high demand to consider the true geometry of the beam-pipe in the numerical model. This assures 
that the image charge components are taken properly into account. This results in a more exact modeling 
of the non-linear beam dynamics which is indispensable for the next generation of particle accelerators. 

In a related paper by Poplau et al. |18j , an iterative solver preconditioned by geometric multigrid is used 
to calculate space-charge forces. The authors employ a mesh with adaptive spacings to reduce the workload 
of the BiCGStab solver used to solve the nonsymmetric system arising from quadratic extrapolation at the 
boundary. The geometric multigrid solver used in their approach is much more sensitive to anisotropic grids 
arising in beam dynamic simulations (e.g. special coarsening operators have to be defined). With smoothed 
aggregation-based algebraic multigrid (AMG) preconditioning as used in this paper the aggregation smoother 
takes care of anisotropies and related issues and leads to a robustness superior to geometric multigrid, 
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see [30] for a discussion. The preconditioner easily adapts to the elongation of the computational domain 
that happens during our simulation. 

In Section [2] we describe how the Poisson equation on a 'general' domain fl C M 3 can be solved by finite 
differences and the PIC approach. We treat the boundary in three different ways, by constant, by linear, 
and by quadratic extrapolation, the latter being similar to the approach of McCorquodale et at. [T7] . The 
system of equation is solved by the conjugate gradient algorithm preconditioned by smoothed aggregation 
AMG [SUES] j see Section [3] For this solver we coined the acronym: SAAMG-PCG. The preconditioned 
conjugate gradient (PCG) algorithm is also used if the system is 'mildly' nonsymmetric. In Section [4] we 
deal with details of the implementation, in particular its parallelization. In Section [5] we report on numerical 
experiments including a physical application from beam dynamics. In Section [6] we draw our conclusions. 

2. The discretization 

In this section we discuss the solution of the Poisson equation in a domain Vl C M 3 as indicated in 



r 2 




Figure 1: Sketch of a typical domain 



Figure [T] The boundary of the domain is composed of two parts, a curved, smooth surface Ti and two 
planar portions at z = —d and z = +d that form together r 2 . In physical terms Ti forms the casing of the 
pipe, while r 2 is the open boundary at the inlet and outlet of the beam pipe, respectively. The centroid of 
the particle bunch is at the origin of the coordinate system. The Poisson problem that we are going to solve 
is given by 

-Ac/) = — in 

6 = g = 0onT 1 , (5) 
36 1 , 

-^ + -^ = on r 2 . 
on a 

The parameter d in the Robin boundary condition denotes the distance of the charged particles to the 
boundary 18 . It is half the extent of f2 in z-direction. Notice that the Robin boundary condition applies 
only on the planar paraxial portions of the boundary. 

We discretize §5§ by a second order finite difference scheme defined on a rectangular lattice (grid) 

fl h :={xeS]ur 2 I Xi/hi G Z for i = 1, 2, 3} , 

where hi is the grid spacing and e^ the unit vector in the i-th coordinate direction. The grid is arranged 
in a way that the two portions of T 2 lie in grid planes. A lattice point is called an interior point if all its 
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direct neighbours are in f2. All other grid points are called near-boundary points. At interior points x we 
approximate Att(x) by the well-known 7-point difference star 
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-A h u{x) = 2_^ p . (6) 

i—l 1 

At grid points near the boundary we have to take the boundary conditions in ([5| into account. To explain 
the schemes on the Dirichlet (or PEC) boundary Ti let x be a near-boundary point. Let x' :— x — h^i for 
some i be outside f2 and let x* := x — s^e^, Q < s < 1, be the boundary point between x and x' that is 
closest to x, cf. Figure [2] If s = 1, i.e., if x' € 5f2 then w(x') in ([6| is replaced by the prescribed boundary 

X — sh 

• 1 • • 



XX X X 



Figure 2: 1-dimensional sketch of a near-boundary point x 

value. Otherwise, we proceed in one of the three following ways [5j [10] 

1. In constant extrapolation the boundary value prescribed at x — s^e.; € Ti is assigned to it(x'), 

u(x') =u(x-hiei) := g(x*). (7) 

2. In linear extrapolation the value at x' is obtained by means of the values u at x and at x — sh^i, 

u(x'):=(l-- s )u(x) + - s g(x*). (8) 

3. Quadratic extrapolation amounts to the Shortley-Weller approximation [51 1171 [25] . If x" := x + Zi^ej £ 
tth then the value u(x') is obtained by quadratic interpolation of the values of u at x, x", and the 
boundary point x* , 

u(x ') := fcil u(x) _ £zi u (x") + — ^— 9(x*). (9) 
s s+1 s(s + l) 

If x" 57;,, then let x** := x + i/ije,-, < t < 1, be the boundary point between x and x" that is 
closest to x. Then, similarly as before, we get 

. ,, (s-l)(i+l) , , (f+1) . t . (s-1) , 
u(x') := -u(x) + y r-g(x ) - 7 f o(x**), 

y ' st y ' s+t s 1 ' (s+t)r y h 

(10) 

+ (f-1) (s+1) , 

"(*):= 7 " x ~ 7 . ,x g( x ) + 7 , , u ff x • 

sr (s+t)s (s+t)t 

In all extrapolation formulae given above we set g(x*) = g(x**) = according to ([5]). The value on the 
right side of (|7|)-(|To| substitutes u(x±hiei) in 

Let us now look at a grid point x on the open boundary T 2 . If x is located on the inlet of the beam 
pipe then x" := x+h-^eT, G £1 and x' := x — ^1363 ^ f2. The Robin boundary condition is approximated by a 
central difference, 

u(x")-u(x') 1 . . 

or 

w(x')=u(x") pti(x). (11) 

a 

The same formula holds on the outlet boundary portion if x' denotes the virtual grid point outside fi. 

Notice that some lattice points may be close to the boundary with regard to more than one coordinate 
direction. Then, the procedures (|7|-([TT]) must be applied to all of them. 



The finite difference discretization just described leads to a system of equations 



Ax = b, (12) 

where x is the vector of unknown values of the potential and b is the vector of the charge density interpolated 
at the grid points. The Poisson matrix A is an M-matrix irrespective of the boundary treatment |10j . 
Constant and linear extrapolation lead to a symmetric positive definite A while quadratic extrapolation 
yields a nonsymmetric but still positive definite Poisson matrix. 

Notice that the boundary extrapolation introduces large diagonal elements in A if s or t gets close to 
zero. In order to avoid numerical difficulties it is advisable to scale the system matrix. If D = diagA, then 
we replace A in (12) by D~ x l 2 AD~ X I 2 and adapt x and b accordingly. 



3. The solution method 



In this section we discuss the solution of (12), the Poisson problem ^ discretized by finite differences 
as described in the previous section. 

3.1. The conjugate gradient algorithm 



The matrix A in (12) is symmetric positive definite (spd) if the boundary conditions are treated by 
constant or linear extrapolation. For symmetric positive definite systems, the conjugate gradient (CG) 
algorithm [10l [12] provides a fast and memory efficient solver. The CG algorithm minimizes the quadratic 
functional 

<p(x) = ^x T Ax - x T b (13) 

in the Krylov space that is implicitly constructed in the iteration. In the k-th iteration step the CG 
algorithm minimizes the quadratic functional (p along a search direction d^. The search directions turn out 
to by pairwise conjugate, d^Adj — for all k ^ j, and <£>(x) is minimized in the whole fc-dimcnsional Krylov 
space. 

If we use the quadratic extrapolation ([9| at the boundary then A in ( |12[ ) is not symmetric positive 



definite anymore. Nevertheless, the solution of (12) is still a minimizcr of </?(x). The CG algorithm can be 
used to solve (12). It is known to converge [8]. However, the finite termination property of CG is lost as the 
search directions are not mutually conjugate any more. Only consecutive search directions are conjugate, 
d^Adk-i = 0, reflecting the fact that y(x) is minimized only locally. Young & Jea [35] investigated 
generalizations of the conjugate gradient algorithm for nonsymmetric positive definite matrices, in which 
conjugacy is enforced among d^, . . . ,dk- s for some s > 1. We do not pursue these approaches here. As 
GMRES 23J, they consume much more memory space than the straightforward CG method that turned 
out to be extremely efficient for our application. Although A is nonsymmetric it is so only 'mildly', i.e., 
there are some deviations from symmetry only at some of the boundary points. Therefore, one may hope 
that the conjugate gradient method still performs reasonably well. This is what we actually did observe in 
our experiments. 

Methods that are almost as memory efficient as CG like, e.g., the stabilized biconjugate gradient 



(BiCGStab) method [34] could be used for solving (12), also. However, when considering computational 



costs we note that BiCGStab requires two matrix-vector products per iteration step, in contrast to CG that 
requires only one. 

3.2. Preconditioning 



To improve the convergence behavior of the CG methods we precondition (12). The preconditioned 
system has the form 

M~ x Ax = M _1 b, 

where the positive definite matrix M is the preconditioner. A good choice of the preconditioner reduces the 
condition of the system and thus the number of steps the iterative solver takes until convergence [SI [TO] ■ 
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Preconditioning is inevitable for systems originating in finite difference discretizations of 2nd order PDE's 
since their condition number increases as h~ 2 where h is the mesh width |16j . 

In this paper we are concerned with multilevel preconditioners. Multigrid or multilevel preconditioners 
are the most effective preconditioners, in particular for the Poisson problem 9, 3f . Multigrid methods make 
use of the observation that a smooth error on a fine grid can be well approximated on a coarser grid. When 
this coarser grid is chosen to be a sufficient factor smaller than the fine grid the resulting problem is smaller 
and thus cheaper to solve. We can continue coarsening the grid until we arrive at a problem size that can 
be solved cheaply by a direct solver. This observation suggests an algorithm similar to Algorithm [l] 

Algorithm 1 Multigrid V-cycle Algorithm 
1: procedure MultiGridSolve(^4£, b^, x^, £) 
2: if £ = maxLcvel — f then 
3: DircctSolve A^X( = h( 
4: else 

5: X£ <— S^ re (Ai, he, 0) {presmoothing} 
6: ri 4— he - Afx.g {calculate residual} 
7: b^ + i <— Rgrg {restriction} 

8: V £+1 <r- 

9: MultiGridSolve(^+i, b e+1 , v £+1 , £+1) 
10: x<> x^ + PfV(+i {coarse grid correction} 
li: x, <- S? ost (A t , h ( , x,) 
12: end if 
13: end procedure 



The procedure starts on the finest level (£ = 0) and repeatedly coarsens the grid until the coarsest level 
is reached (maxLevel) on which a direct solver is used to solve the problem at hand. On all other levels 
£ the algorithm starts by presmoothing S^ re the problem to damp high frequency components of the error 
(line 5). Subsequently the fine grid on level £ can be restricted with the restriction operator Re to a coarser 
grid on level £ + 1 (line 7) . This essentially "transfers" the low frequency components on the fine grid to high 
frequency components on the coarse grid. After the recursion has reached the coarsest level and used the 
direct solver to solve the coarse level problem the solution can be prolongated back to a finer grid. This is 
achieved with the prolongation operator Pe (line fO). Often a postsmoother S^ ost is used to remove artifacts 
caused by the prolongation operator. Usually these operators (for every level £) are defined in a setup phase 
preceding the execution of the actual multigrid algorithm. Lastly, Ae denotes the matrix of the discretized 
system in level £. 

The performance of multigrid methods profoundly depends on the choices and interplay of the smoothing 
and restriction operators. To ensure that the resulting preconditioner is symmetric we use the same pre- 
and postsmoother Sg and the restriction operator is chosen to be the transpose of the prolongation operator 
Re = Pj '■ This leaves us with two operators, Pe and Se, that have to be defined for every level. 

Prolongation Operator Pe- Aggregation based methods cluster the fine grid unknowns to aggregates (of a 
specific form, size, etc.) as representation for the unknowns on the coarse grid. First, each vertex of Ge , the 
adjacency graph of Ae, is assigned to one of the pairwise disjoint aggregates. Then, a tentative prolongation 
operator matrix is formed where matrix rows correspond to vertices and matrix columns to aggregates. A 
matrix entry has a value of 1 if the i th vertex is contained in j th aggregate and otherwise. This 

prolongation operator basically corresponds to a piecewise constant interpolation operation. To improve 
robustness one can additionally smooth the tentative prolongation operator. This is normally done with a 
damped Jacobi smoother. In general applying a smoother results in better interpolation properties opposed 
to the piecewise constant polynomials and improves convergence properties. Tuminaro & Tong |32| propose 
various strategies how to parallelize this process. The simplest strategy is to have each processor aggregate 
its portion of the grid. This method is called "decoupled" since the processors act independently of each 
other. Usually the aggregates are formed as cubes of 3 d vertices in d dimensions. Since the domains under 
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consideration are close to rectangular the decoupled scheme seems to be an appropriate strategy. In the 
interior of our domain we get regular cubes covering the specified number of vertices. Only a few aggregates 
near subgrid interfaces and domain boundary contain fewer vertices resulting in a non-optimal aggregate size. 
The overhead introduced is small. "Coupled" coarsening strategies, e.g. Parmetis, introduce interprocessor 
communication and are often needed in the presence of highly irregular domains. In our context applying 
uncoupled methods only restrict the size of the coarsest problem. This is due to the fact that on the coarsest 
level each processor must at least hold one degree of freedom. 

Smoothing operator Si. As advised in [1] we choose a Chebyshev polynomial smoother. The choice is moti- 
vated by the observation that polynomial smoothers perform better in parallel than Gauss-Seidel smoothers. 
Advantages are, e.g., that polynomial smoothers do not need special matrix kernels and formats for optimal 
performance and, generally, polynomial methods can profit of architecture optimized matrix vector prod- 
ucts. Nevertheless, routines are needed that yield bounds for the spectrum. But these are needed by the 
prolongator smoother anyway. 

Coarse level solver. The employed coarse level solver (Amesos-KLU) ships the coarse level problem to node 
and solves it there by means of an LU factorization. Once the solution has been calculated it is broadcast 
to all nodes. To gather and scatter data a substantial amount of communication is required. Moreover the 
actual solve can be expensive if the matrix possesses a large amount of nonzeros per row. 

An alternative is to apply a few steps of an iterative solver (e.g. Gauss-Seidel) at the coarsest level. 
A small number of iteration steps decreases the quality of the preconditioner and thus increases the PCG 
iteration count. A large number of iteration steps increases the time for applying the AMG preconditioner. 
We found three Gauss-Seidel iteration steps to be a good choice for our application. 

Cycling method. We observed a tendency that timings for the W-cycle are 10% - 20% slower compared with 
the V-cycle. 

4. Implementation details 

The multigrid preconditioner and iterative solver are implemented with the help of the Trilinos frame- 
work |TTJ[5S]. Trilinos provides state-of-the-art tools for numerical computation in various packages. Aztec, 
e.g., provides iterative solvers and ML |7J multilevel preconditioners. By means of ML, we created our 
smoothed aggregation-based AMG preconditioner. The essential parameters of the preconditioner discussed 
above are listed in Table [TJ 



name 


value 


preconditioner type 


MGV 


smoother 


pre and post 


smoother type 


Chebyshev 


aggregation type 


Uncoupled 


coarse level solver 


Amesos-KLU 


maximal coarse level size 


1000 



Table 1: Parameters for multilevel preconditioner ML. 



To embed the solver in the physical simulation code (OPAL [2]) we utilized the Independent Parallel 
Particle Layer (IP 2 L 3 ). This library is an object-oriented framework for particle based applications 
in computational science designed for the use on high-performance parallel computers. In the context of 
this paper IP 2 L is only relevant because OPAL uses IP 2 L to represent and interpolate the particles at 
grid points with a charge conserving Cloud-in-Cell area weighting scheme. ML requires an Epetra_Map 
handling the parallel decomposition to create parallel distributed matrices and vectors. To avoid additional 
communication the Epetra_Map and the IP 2 L field are determined to have the same parallel decomposition. 
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In this special case the task of converting the IP 2 L field to an Epetra_Vector is as simple as looping over 
local indices and assigning values. 

A particle based domain decomposition technique based on recursive coordinate bisection is used (see 
Section 5.4 1 to parallelize the computation on a distributed memory environment. One, two and three- 
dimensional decompositions are available. For problems of beam dynamics with highly nonuniform and time 
dependent particle distributions, a dynamic load balancing is necessary to preserve the parallel efficiency of 
the particle integration scheme. Here, we rely on the fact that IP 2 L attains a good load balance of the data. 

We use the solution of one time step as the initial guess for the next time step. 



4-1- Domains 

The simplest domains under consideration are regular, rectangular domains. This domains are used by 
the FFT solver with so-called open-space boundary conditions as described in [T5] . 

Our SAAMG-PCG solver is not restricted to rectangular domains. It can handle irregular domains as 
the ones introduced in the next section. 



Non-rectangular domains 

To properly handle emerging irregular domains we implemented an abstract class providing an interface 
to query the discretization near the boundary. Every implementation of an irregular domain has to identify 
boundary points and provide the stencil for near-boundary points given one of the extrapolation schemes 
discussed in Section [2] Boundary points are stored in STL containers. Essentially the coordinate value of a 
gridline is mapped to its intersection values, providing a fast look-up table for a given gridlinc. 

In this work we use the SAAMG-PCG solver mainly for cylindrical domains with an elliptic base area. 
These domains can be characterized by means of two parameters: the semi-major and semi-minor axis. 
We compute the intersection points of the grid with the elliptical domain boundary by using its implicit 
representation and subsequently store them into a STL container. These intersections have to be recomputed 
whenever the parameters of the ellipse or the mesh spacings change. 



5. Numerical Experiments and Results 

In this section we discuss various numerical experiments and results concerning different variants of the 
preconditioner and comparisons of solvers and boundary extrapolation methods. Unless otherwise stated 
the measurements are done on a tube embedded in a rectangular equidistant 256 x 256 x 256 grid. This is 
a common problem size in beam dynamics simulations. Most of the computations were performed on the 
Cray XT4 cluster of the Swiss Supercomputing Center (CSCS) in Manno, Switzerland. Computations up to 
512 cores were conducted on the XT4 cluster (Buin) with 468 AMD dual core Opteron 2.6 GHz processors 
and a total of 936 GB DDR RAM on a 7.6 GB/s interconnect bandwith network. Larger computations were 
performed on the XT3 cluster (Palu) with 1692 AMD dual core Opteron 2.6 GHz processors and a total of 
3552 GB DDR RAM interconnected with a 6.4 GB/s interconnect bandwith network. 

Throughout this section we will report the timings of portions of the code as listed in Table [2j 

name description 

construction time for constructing the ML hierarchy 

application time for applying the ML preconditioner 

total ML total time used by ML (= construction + application) 

solution time needed by the iterative solver 

Table 2: Description of various timings used. 
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5.1. Comparison of Extrapolation Schemes 

For validation and comparison purposes we applied our solver to a problem with a known analytical 
solution. We solved the Poisson problem with homogeneous Dirichlet boundary conditions (</> = 0) on a 
cylindrical domain fi = {|r|<^}x(— The axisymmetric charge density 

p = - ^7r 2 r 2 -^-A\ sin(7r(z - 0.5)) 
gives rise to the potential distribution 

<f>(r, 6,z)=(\- r 2 ) sin(7r(z - 0.5)). 



The charge density in our test problem is smoother than in real problems. Nevertheless, it is very small 
close to the boundary. This reflects the situation in particle accelerators where most particles are close to 
the axis of rotation and, thus, charge densities are very small close to the boundary. 
We measure the error on the grid f2/j with mesh spacing h in the discrete norms 



\\e h h = \\h-<t>h= h? £ \{kh-<Pr)\ 2 , 
Halloo = \\4>h - 4>\\ao = max \4> i h - 4>i\, 

where 4>h is the approximation of the solution (f> on O/j, and e/j denotes the corresponding error. The 
convergence rate is approximately 

"l e 2fc| 



r = log 2 



eh 



h 


IMIa 


r 


\\ e h\\oo r 


||eh||oo/||0| 


oo 


1/64 
1/128 


2.162 x 10~ 3 
1.240 x 10" 3 


0.80 


7.647 x 10~ 3 
4.153 x 10" 3 0.88 


3.061 x 10" 
1.662 x 10" 


-2 
-2 




Table 3: Solution 


error 


for constant extrapolation, d = 3. 




h 


l|e fc || a 


r 


|e/i||oo r 


ll e ^l|oo/||'/'| 


oo 


1/64 
1/128 


2.460 x 10~ 5 
6.226 x 10" 6 


1.98 


6.020 x 10~ 5 
1.437 x 10" 5 2.07 


2.410 x 10 
5.751 x 10" 


-4 
-5 




Table 4: Solution error for linear extrapolation, c 


1=3. 




h 


\\thh 


r 


||e/, [|oo r 


I|e/»|U/||<M 


oo 


1/64 
1/128 


5.581 x 10~ 6 
1.384 x 10" 7 


2.01 


1.689 x 10" 5 
4.550 x 10" 6 1.89 


6.761 x 10- 
1.820 x 10- 


-5 
-5 



Table 5: Solution error for quadratic extrapolation, d = 3. 



We solved the Poisson equation with the boundary extrapolation methods introduced earlier. The errors 
are listed in Tables [3J|5] The numbers confirm the expected convergence rates, i.e., linear for the constant 
extrapolation and quadratic for the linear and quadratic extrapolation [14] - The results obtained with the 
linear extrapolation scheme are more accurate than constant extrapolation by two orders of magnitude. 
Quadratic extrapolation is again more accurate than linear extrapolation, but for both norms by only a 
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factor 3 to 5. It evidently does not make sense to use constant extrapolation as the cost of solving with 
linear boundary extrapolation is equal. In contrast, the quadratic boundary treatment entails the drawback 
that discretization matrices lose symmetry. They are still positive definite, however. In the particular setting 
of this test problem as well as in others we have investigated (e.g. in real particle simulations) the system 
matrices were just 'mildly' nonsymmetric such that PCG could be applied savely and without performance 
loss. 

5.2. ML variations 

Multilevel preconditioners are highly sophisticated preconditioners. Not surprisingly, their construction 
is very time consuming. To build an SA-AMG preconditioner (1) the "grid" hierarchy (including aggregation 
and construction of tentative prolongator) , (2) the final grid transfer operators (smoothing the prolongators), 
and (3) the coarse grid solver have to be set up. 

In the following subsections we investigate various variants of the preconditioner or more precisely variants 
of the construction of the preconditioner when solving a sequence of related Poisson problems. 

The default variant builds a new preconditioner in every time step. In the sequel we will investigate how 
costly this is. Other variants reuse portions of previous preconditioners. 

We compare with the FFT-based Poisson solver [T3] that OPAL (version 1.1.5) provides for open-space 
boundary conditions. The FFT kernel is based on a variant of the FFTPACK library [41 12"8]. 

Reusing the aggregation hierarchy 

Since the geometry often changes only slowly in time, the topology of the computational grid does not or 
only rarely alter. Therefore, it makes sense to reuse the aggregation hierarchy and in particular the tentative 
prolongators for some or all iterations. Only smoothed prolongators and other components of the hierarchy, 
like smoothers and coarse solver, are recomputed [7\ p. 16]. This leads to a preconditioner variation in which 
the aggregation hierarchy is kept as long as possible. The numbers in Table [6] show that this minor change 
in the set up phase reduces construction times by approximately 30%. 



cores 


average of 10 


one 




time steps (s) 


time step (s) 


16 


6.98 


10.3 


32 


4.36 


6.44 


64 


2.38 


3.48 


128 


1.34 


1.91 


256 


0.735 


1.04 


512 


0.518 


0.745 



Table 6: Preconditioner construction times. Left: average cost of 10 time steps reusing the hierarchy of the first iteration step. 
Right: cost of a time step when building the whole preconditioner. 

Reusing the aggregation hierarchy is a feature provided by ML. It is intended for use in nonlinear systems 
solving. In our simulations it reduced the time per AMG solve in a simulation run by approximately 25%, 
see Table 

Reusing the preconditioner 

We can be more aggressive by reusing the preconditioner of the first iteration throughout a whole simu- 
lation. Although the iteration count increased, the time-to-solution reduced considerably. To counteract an 
excessive increase of the iteration steps the preconditioner can be recomputed once the number of iterations 
exceeds a certain threshold. (This was not necessary in our experiments, though.) 

Applying this approach to a cylinder-shaped beam pipe, a single preconditioner could be used throughout 
the entire simulation without any severe impact on the number of iteration steps. 
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(a) (b) 

Figure 3: Sketch of the test cases with equal number of mesh points (a), and equal mesh resolution (b), respectively. Displayed 
are the shared (square), FFT only (triangle), and AMG only (filled circle) mesh points on a cross section of the grid plane. 
Illustrative particles (gray) inside the FFT domain denote the charge density. 



In the following we compare accuracy and performance of the SA-AMG preconditioned conjugate gradient 
algorithm with a FFT-based solver. The principle difficulty in the comparison stems from the fact that the 
CG-based and the FFT-based solvers use different discretizations of the tube-shaped domain ft. Although 
both use finite-difference discretizations, with SA-AMG the whole domain is discretized while with FFT 
just a rectangular domain along the center line of the cylinder is taken into account. In some applications 
(beam pipes of light sources) the rectangular domain is contained in f2. In our application the rectangular 
domain has a similar volume such that the boundaries can be intertwined. Therefore, we expect that the 
more accurate boundary treatment in the iterative solver has a noticeable positive effect on the solution. 

So, we came up with two test cases as illustrated in Figure [3] 

• The first test case (a) displayed on the left corresponds to a situation where both methods have about 
the same number of unknowns. In the FFT-based approach only the close vicinity of the particle 
bunch is discretized. In contrast, in the PCG-AMG approach the grid extends to the whole domain, 
entailing a coarser grid than with the FFT-based approach. 

• The second test case (b) displayed on the right corresponds to a situation where both methods have 
the same mesh resolution in the vicinity of the particles. This results in a higher number of mesh 
points for the PCG-AMG approach. 

We consider a cylindrical tube of radius r — 0.001 m. The FFT-based solver used a grid with 128 2 • 256 = 
4, 194, 304 nodes. The grid with a similar number of points but coarser resolution consisted of 3,236,864 
grid points. To obtained the same resolution with the SAAMG-PCG solver, 5,462,016 grid points were 
required. They are embedded in a 166 x 166 x 256 grid that coincides in the middle of the region of simulation 
with the 128 x 128 x 256 grid for the FFT-based solver. The boundary conditions were implemented by 
linear extrapolation ([8]). In Table [7] we give execution times for the first and second iteration in a typical 
simulation. Since we start with quite a good vector the number of steps in the second iteration is about 
30% smaller than in the first iteration. This accounts for the reduced execution time in the runs where the 
complete preconditioners are recomputed for each iteration. The savings from this straightforward mode to 
the two cases reusing either hierarchy or the entire preconditioner amounts to approximately 20% and 40% 
respectively. 

We inspected the SAAMG-PCG results for the two mesh sizes and found no physically relevant differ- 
ences. This behaviour depends on the ratio of bunch and boundary radius. Based on the differences of the 
FFT and SAAMG-PCG solver (e.g. boundary treatment and implementation details) it is hard to come 
up with a 'fair' comparison. There certainly exists a correlation between the number of performed (CG) 
iterations and the time to solution. Determining the right stopping criteria and tolerance therefore has an 
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solver 


reusing 


mesh size 


mesh points 


first [s] 


second [s] 


FFT 




128 x 128 x 256 


4,194,304 


12.3 




AMG 




128 x 128 x 256 


3,236,864 


49.9 


42.2 


AMG 


hierarchy 


128 x 128 x 256 


3,236,864 




35.5 


AMG 


preconditioncr 


128 x 128 x 256 


3,236,864 




28.2 


AMG 




166 x 166 x 256 


5,462,016 


81.8 


71.2 


AMG 


hierarchy 


166 x 166 x 256 


5,462,016 




60.4 


AMG 


preconditioncr 


166 x 166 x 256 


5,462,016 




43.8 



Table 7: Simulation timings of one solve in the first and second time step, respectively, with r = 0.001 m. Equal mesh points 
(above) and equal mesh spacings (below) for FFT and AMG. 



important impact on the performance. While still achieving the same accuracy of the physics of a simulation 
it could be possible to execute fewer CG iterations by using a higher tolerance. For the measurements in 
Table [7] we used the stopping criterion 

IH| a <e||6||a, 

with the tolerance e = 10 -6 . 

These results illustrate that an increase in solution accuracy of approximately 2.3 in the best case (when 
the domain has irregularities) is incurred in moving from an FFT-based scheme to a more versatile approach. 
Of course, this more versatile approach gives rise to increased accuracy. 

Coarse level solver 

Another decisive portion of the AMG preconditioner is the coarse level solver. We applied either a direct 
solver (KLU) or used a couple of steps of Gauss-Seidel iteration. Our experiments indicate that for our 
problems ML setup time and scalability are not affected significantly by the two coarse level solvers. The 
difference in setup and application of KLU and Gauss-Seidel differ only within a few percent. Only the 
construction of the preconditioner is cheaper with the iterative coarse level solver. 

As stated by Tuminaro & Tong [35] the number of iterations done by the iterative coarse level solver is 
crucial for the performance of the preconditioner. Too many iteration steps slow down the preconditioner 
without a corresponding increase of its quality. Too few iteration steps with the coarse level solver degrade 
the quality of the overall preconditioner and lead to an increased number of steps of the global system solver. 
We tuned the iterative coarse level solver such that the overall quality of the preconditioner was about the 
same as with the direct coarse level solver. It turned out that 3 iteration steps sufficed. 

Coarse level size 

We also tested the performance of the solver for varying sizes of the coarsest level. ML seems to perform 
best when the coarsest grid size is around 1000. With a limit of 1000, the coarsest grid sizes ranged from 
128 to 849 when running a tube embedded in a 256 x 256 x 256 grid on 16 to 512 cores. 

At the same time we tried to set the size of the coarsest level proportional to the total available number 
of cores in order to get a sufficient large coarse level size. It turned out to be very difficult to set a coarse 
level size with heuristics like this. The factorization time increased to up to 2 s in contrast to 0.25 s for the 
case where the coarsest level size is limited by 1000. 

5.3. Speedup and efficiency 

The efficiency of the AMG solver using linear boundary extrapolation is shown in Figure |4j On the left of 
this figure, the results listed in Table[8]for a 256 3 grid are plotted. We observe an efficiency of approximately 
62% for 256 cores relative to the timing on 16 processors, the minimal number of processors to solve the 
problem. The efficiency dropped just below 50% for 512 cores. The parallel efficiency is affected most by the 
poor scalability of the construction phase of ML. After studying various ML timings we could not identify a 
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Figure 4: Efficiency for AMG on a tube embedded in a 256 x 256 x 256 grid (left) and in a 512 x 512 x 512 grid (right) with 
constant extrapolation at the boundary. 



cores 


solution [s] 


construction [s] 


application [s] 


total ML [s] 


16 


10.91 


7.75 


9.52 


17.28 


32 


6.66 


4.93 


5.68 


10.61 


G4 


3.82 


2.87 


3.19 


6.06 


128 


1.95 


1.51 


1.64 


3.15 


256 


1.15 


0.93 


0.94 


1.88 


512 


0.71 


0.83 


0.61 


1.44 



Table 8: Timings for AMG on a tube embedded in a 256 X 256 X 256 grid with linear extrapolation at the boundary. 



specific reason that causes this loss in efficiency for the construction phase other than the assumption that 
the problem is too small with respect to the number of cores. 

Similar conclusions can be drawn for the parallel efficiency of our solver on a 512 3 grid with constant 
boundary treatment, see Tableland Figure [3] Again the construction phase of ML scales poorly for larger 
numbers of cores. 



Notice that by applying the improvements discussed in Section 5.2 i.e., reusing (parts of) the precondi 



tioner, the time needed for the construction phase can be reduced significantly or avoided altogether. If the 
preconditioner has to be built just once in an entire simulation the efficiency will get close to the 52% that 
we measured for the solution phase. 

Finally, in Table 10 we report on timings obtained for the tube embedded in a 1024 x 1024 x 1024 grid. 
In Figure[5]the corresponding efficiencies are listed. For this large problem we observe good efficiencies. The 
solver runs at 82% efficiency with 2048 cores relative to the 512-cores performance. The construction phase 
is still performing the worst with an efficiency of 73%. In this setup the problem size is still reasonably large 
when employing 2048 cores. This consolidates our understanding of the influence of the problem size on the 



cores 


solution [s] 


construction [s] 


application [s] 


total ML [s] 


256 


10.51 


5.79 


8.70 


14.49 


512 


5.03 


3.03 


4.16 


7.19 


1024 


2.90 


2.31 


2.45 


4.76 


2048 


2.54 


7.58 


2.27 


9.85 



Table 9: Timings for AMG on a tube embedded in a 512 X 512 X 512 grid with constant extrapolation at the boundary. 
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Figure 5: Efficiency for AMG on a tube embedded in a 1024 X 1024 X 1024 grid with linear extrapolation at the boundary, 
corresponding to the numbers in Table [10"| 



cores 


solution [s] 


construction [s] 


application [s] 


total ML [s] 


512 


35.83 


20.78 


29.53 


50.31 


1024 


18.87 


11.80 


15.65 


27.46 


2048 


10.93 


6.68 


9.25 


15.93 



Table 10: Timings for AMG on a tube embedded in a 1024 X 1024 X 1024 grid with linear extrapolation at the boundary. 

Tables |8]jl0| provide a few data to investigate weak scability. The timings for 2048 processors in Table [T0| 
should ideally equal those for 256 processors in Table [9j In fact, they are quite close. A comparison of the 
timings in Tables [9] and [8] is not so favorable. The efficiency is at most 84%. The numbers for 2048 processors 
in Table [9] show that the construction of the multilevel preconditioner becomes excessively expensive if the 
number of processors is high. 

5.4- Open-space vs. PEC Boundary Conditions 

In this section we compare the impact of two different boundary conditions in the setting of a physical 
simulation consisting of an electron source (4 MeV) followed by a beam transport section [25] . As the pipe 
radius gets close to the particles in the beam, the fields become nonlinear due to the image charge on the 
pipe. We compare the root-mean-square (rms) beam size in a field free region (drift) of a convergent beam, 
cf. [351 pp-171ff]. The beam pipe radius is r = 0.00085 m in case of the PGC-MG solver. This is an extreme 
case in which the particles fill almost the whole beam pipe and hence the effect is very visible. The capability 
to have an exact representation of the fields near the boundaries is very important, because the beam pipe 
radius is an important optimization quantity, towards lower construction and operational costs in the design 
and operation of future particle accelerators. 

In Figure [6] we compare rms beam sizes for the two boundary conditions applied to the boundary of a 
cylinder with elliptical base-area as described in Section [2] 

The differences are up to 40% (at z — 1.435 m) in rms beam size, when comparing the PEC and the 
open-space approach. We clearly see the shift of the beam size minimum (waste) towards larger z values 
and a smaller minima, which means that the force of the self fields are larger when considering the beam 
pipe. This increase in accuracy justifies the accurate boundary treatment, in situations where the spatial 
extent of the beam is comparable with that of the beam pipe. 
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Figure 6: Comparison of rms beam size vs. position of a convergent beam. The FFT solver is applied for open-space boundary 
conditions, for PEC two variants of the SA AMG-PCG solver is used: with linear and quadratic boundary interpolation, (region 
of interest magnified). The computational domain Q is a cylinder with r = 0.00085 m . 



6. Conclusion 

We have presented a scalable Poisson solver suitable to handle domains with irregular boundaries as 
they arise, for example, in beam dynamics simulations. The solver employs the conjugate gradient algo- 
rithm preconditioned by smoothed aggregation based AMG (SAAMG-PCG). PEC and open-space boundary 
approximations have been discussed. A real world example where the solver was used in a beam dynamics 
code (OPAL) shows the relevance of this approach by observing up to 40% difference in the RMS beam size 
when comparing to the FFT-based solver with open domains. The code exhibits excellent scalability up to 
2048 processors with cylindrial tubes embedded in meshes with up to 1024 3 grid points. In the very near 
future, this approach will enable precise beam dynamics simulations in large particle accelerator structures 
with a level of detail not obtained before. 

In real particle simulations (and other test cases encountered) system matrices arising from quadratic 
boundary treatment are only 'mildly' nonsymmetric such that PCG can be applied. 

Planed future work includes adaptive mesh refinement in order to reduce the number of grid points in 
regions that are less relevant for the space charge calculation. The boundary treatment for simply connected 
geometries will be extended to cope with more realistic geometries. This new method, designed for accurate 
3D space charge calculations, will be used in the beam dynamics simulations for the SwissFEL project, a 
next generation light source foreseen to be built in Switzerland. 
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